32 research outputs found
On the complementary nature of ANOVA simultaneous component analysis (ASCA+) and Tucker3 tensor decompositions on designed multi-way datasets
The complementary nature of analysis of variance (ANOVA) Simultaneous
Component Analysis (ASCA+) and Tucker3 tensor decompositions is demonstrated
on designed datasets. We show how ASCA+ can be used to (a) identify
statistically sufficient Tucker3 models; (b) identify statistically important triads
making their interpretation easier; and (c) eliminate non-significant triads
making visualization and interpretation simpler. For multivariate datasets
with an experimental design of at least two factors, the data matrix can be
folded into a multi-way tensor. ASCA+ can be used on the unfolded matrix,
and Tucker3 modeling can be used on the folded matrix (tensor). Two novel
strategies are reported to determine the statistical significance of Tucker3
models using a previously published dataset. A statistically sufficient model
was created by adding factors to the Tucker3 model in a stepwise manner until
no ASCA+ detectable structure was observed in the residuals. Bootstrap analysis
of the Tucker3 model residuals was used to determine confidence intervals
for the loadings and the individual elements of the core matrix and showed
that 21 out of 63 core values of the 3 7 3 model were not significant at the
95% confidence level. Exploiting the mutual orthogonality of the 63 triads of
the Tucker3 model, these 21 factors (triads) were removed from the model. An
ASCA+ backward elimination strategy is reported to further simplify the
Tucker3 3 7 3 model to 36 core values and associated triads. ASCA+ was
also used to identify individual factors (triads) with selective responses on
experimental factors A, B, or interactions, A B, for improved model visualization
and interpretation
Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: theoretical aspects
[EN] Cross-validation has become one of the principal methods to adjust the meta-parameters in predictive models.
Extensions of the cross-validation idea have been proposed to select the number of components in principal
components analysis (PCA). The element-wise k-fold (ekf) cross-validation is among the most used algorithms for
principal components analysis cross-validation. This is the method programmed in the PLS_Toolbox, and it has been
stated to outperform other methods under most circumstances in a numerical experiment. The ekf algorithm is
based on missing data imputation, and it can be programmed using any method for this purpose. In this paper,
the ekf algorithm with the simplest missing data imputation method, trimmed score imputation, is analyzed. A
theoretical study is driven to identify in which situations the application of ekf is adequate and, more importantly,
in which situations it is not. The results presented show that the ekf method may be unable to assess the extent to
which a model represents a test set and may lead to discard principal components with important information. On a
second paper of this series, other imputation methods are studied within the ekf algorithmResearch in this area is partially supported by the Spanish Ministry of Economy and Competitiveness and FEDER funds from the European Union through grant DPI2011-28112-C04-02. Jose Camacho was funded by the Juan de la Cierva program, Ministry of Science and Innovation, Spain. This study was carried out when Jose Camacho was at the Universidad Politecnica de Valencia and Universitat de Girona, Spain.Camacho Páez, J.; Ferrer Riquelme, AJ. (2012). Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: theoretical aspects. Journal of Chemometrics. 26(1):361-373. https://doi.org/10.1002/cem.2440S361373261Wold, S. (1978). Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models. Technometrics, 20(4), 397-405. doi:10.1080/00401706.1978.10489693Eastment, H. T., & Krzanowski, W. J. (1982). Cross-Validatory Choice of the Number of Components From a Principal Component Analysis. Technometrics, 24(1), 73-77. doi:10.1080/00401706.1982.10487712Nomikos, P., & MacGregor, J. F. (1995). Multivariate SPC Charts for Monitoring Batch Processes. Technometrics, 37(1), 41-59. doi:10.1080/00401706.1995.10485888Bro, R., Kjeldahl, K., Smilde, A. K., & Kiers, H. A. L. (2008). Cross-validation of component models: A critical look at current methods. Analytical and Bioanalytical Chemistry, 390(5), 1241-1251. doi:10.1007/s00216-007-1790-1Wise BM Gallagher NB Bro R Shaver JM Windig W Koch RS PLSToolbox 3.5 for use with Matlab 2005Nelson, P. R. C., Taylor, P. A., & MacGregor, J. F. (1996). Missing data methods in PCA and PLS: Score calculations with incomplete observations. Chemometrics and Intelligent Laboratory Systems, 35(1), 45-65. doi:10.1016/s0169-7439(96)00007-xArteaga, F., & Ferrer, A. (2002). Dealing with missing data in MSPC: several methods, different interpretations, some examples. Journal of Chemometrics, 16(8-10), 408-418. doi:10.1002/cem.750Arteaga, F., & Ferrer, A. (2005). Framework for regression-based missing data imputation methods in on-line MSPC. Journal of Chemometrics, 19(8), 439-447. doi:10.1002/cem.946Zhang, P. (1993). Model Selection Via Multifold Cross Validation. The Annals of Statistics, 21(1), 299-313. doi:10.1214/aos/1176349027Louwerse, D. J., Smilde, A. K., & Kiers, H. A. L. (1999). Cross-validation of multiway component models. Journal of Chemometrics, 13(5), 491-510. doi:10.1002/(sici)1099-128x(199909/10)13:53.0.co;2-2Lei, F., Rotbøll, M., & Jørgensen, S. B. (2001). A biochemically structured model for Saccharomyces cerevisiae. Journal of Biotechnology, 88(3), 205-221. doi:10.1016/s0168-1656(01)00269-3López, F., Miguel Valiente, J., Manuel Prats, J., & Ferrer, A. (2008). Performance evaluation of soft color texture descriptors for surface grading using experimental design and logistic regression. Pattern Recognition, 41(5), 1744-1755. doi:10.1016/j.patcog.2007.09.011Camacho, J., Picó, J., & Ferrer, A. (2010). Data understanding with PCA: Structural and Variance Information plots. Chemometrics and Intelligent Laboratory Systems, 100(1), 48-56. doi:10.1016/j.chemolab.2009.10.005Mercer, A. M., & Mercer, P. R. (2000). Cauchy’s interlace theorem and lower bounds for the spectral radius. International Journal of Mathematics and Mathematical Sciences, 23(8), 563-566. doi:10.1155/s016117120000257
On the Generation of Random Multivariate Data
The simulation of multivariate data is often necessary for assessing the performance
of multivariate analysis techniques. The random generation of multivariate data
when the covariance matrix is completely or partly specified is solved by different
methods, from the Cholesky decomposition to some recent alternatives. However,
many times the covariance matrix has to be generated also at random, so that
the data simulation spans different situations from highly correlated to uncorrelated data. This is the case when assessing a new multivariate analysis technique
in Montercarlo experiments. In this paper, we introduce a new algorithm for the
generation of random data from covariance matrices of random structure, where
the user only decides the data dimension and the level of correlation. We will illustrate the application of this algorithm in several relevant problems in multivariate
analysis, namely the selection of the number of Principal Components in Principal Component Analysis, the evaluation of the performance of sparse Partial Least
Squares and the calibration of Multivariate Statistical Process Control systems. The
algorithm is available as part of the MEDA Toolbox v1.1
Variable-selection ANOVA Simultaneous Component Analysis (VASCA)
Motivation: ANOVA Simultaneous Component Analysis (ASCA) is a popular method for the analysis of multivariate
data yielded by designed experiments. Meaningful associations between factors/interactions of the experimental
design and measured variables in the dataset are typically identified via significance testing, with permutation tests
being the standard go-to choice. However, in settings with large numbers of variables, like omics (genomics,
transcriptomics, proteomics and metabolomics) experiments, the ‘holistic’ testing approach of ASCA (all variables
considered) often overlooks statistically significant effects encoded by only a few variables (biomarkers).
Results: We hereby propose Variable-selection ASCA (VASCA), a method that generalizes ASCA through variable
selection, augmenting its statistical power without inflating the Type-I error risk. The method is evaluated with
simulations and with a real dataset from a multi-omic clinical experiment. We show that VASCA is more powerful
than both ASCA and the widely adopted false discovery rate controlling procedure; the latter is used as a benchmark
for variable selection based on multiple significance testing. We further illustrate the usefulness of VASCA for
exploratory data analysis in comparison to the popular partial least squares discriminant analysis method and its
sparse counterpart.Agencia Andaluza del Conocimiento, Regional Government of Andalucia , in SpainEuropean Commission B-TIC-136-UGR20State Research Agency (AEI) of SpainEuropean Social Fund (ESF) RYC2020-030536-IAEI PID2020-118139RB-I0
Group-wise Partial Least Square Regression
This paper introduces the Group-wise Partial Least Squares (GPLS) regression.
GPLS is a new Sparse PLS (SPLS) technique where the sparsity structure is
de ned in terms of groups of correlated variables, similarly to what is done in
the related Group-wise Principal Component Analysis (GPCA). These groups
are found in correlation maps derived from the data to be analyzed. GPLS is
especially useful for exploratory data analysis, since suitable values for its metaparameters can be inferred upon visualization of the correlation maps. Following
this approach, we show GPLS solves an inherent problem of SPLS: its tendency
to confound the data structure as a result of setting its metaparameters using
standard approaches for optimizing prediction, like cross-validation. Results are
shown for both simulated and experimental data
On the use of the observation-wise k-fold operation in PCA cross-validation
Cross-validation (CV) is a common approach for determining the optimal number of components in a principal component analysis model. To guarantee the
independence between model testing and calibration, the observation-wise k-fold
operation is commonly implemented in each cross-validation step. This operation renders the CV algorithm computationally intensive and it is the main
limitation to apply CV on very large data sets. In this paper we carry out an
empirical and theoretical investigation of the use of this operation in the element
wise k-fold (ekf ) algorithm, the state-of-the-art CV algorithm. We show that
when very large data sets need to be cross-validated and the computational time
is a matter of concern, the observation-wise k-fold operation can be skipped.
The theoretical properties of the resulting modi ed algorithm, referred to as
column wise k-fold (ckf ) algorithm, are derived. Also, its performance is evaluated with several arti cial and real data sets. We suggest the ckf algorithm
to be a valid alternative to the standard ekf to reduce the computational time
needed to cross-validate a data set
Present and Future of Network Security Monitoring
This work was funded by the Ministry of Science and Innovation through CDTI through the Ayudas Cervera para Centros Tecnologicos grant of the Spanish Centre for the Development of Industrial Technology (CDTI) through the Project EGIDA under Grant CER-20191012, and in part by the Spanish Ministry of Economy and Competitiveness and European Regional Development Fund (ERDF) funds under Project TIN2017-83494-R.Network Security Monitoring (NSM) is a popular term to refer to the detection of security incidents by monitoring the network events. An NSM system is central for the security of current networks, given the escalation in sophistication of cyberwarfare. In this paper, we review the state-of-the-art in NSM, and derive a new taxonomy of the functionalities and modules in an NSM system. This taxonomy is useful to assess current NSM deployments and tools for both researchers and practitioners. We organize a list of popular tools according to this new taxonomy, and identify challenges in the application of NSM in modern network deployments, like Software Defined Network (SDN) and Internet of Things (IoT).Ministry of Science and Innovation through CDTI through the Ayudas Cervera para Centros Tecnologicos grant of the Spanish Centre for the Development of Industrial Technology (CDTI) through the Project EGIDA CER-20191012Spanish Ministry of Economy and CompetitivenessEuropean Regional Development Fund (ERDF) funds TIN2017-83494-
Monitorización y selección de incidentes en seguridad de redes mediante EDA
Uno de los mayores retos a los que se enfrentan los sistemas de monitorización de seguridad en redes es el gran volumen de datos de diversa naturaleza y relevancia que deben procesar para su presentación adecuada al equipo administrador del sistema, tratando de incorporar la información semántica más relevante. En este artículo se propone la aplicación de herramientas derivadas de técnicas de análisis exploratorio de datos para la selección de los eventos críticos en los que el administrador debe focalizar su atención. Adicionalmente, estas herramientas son capaces de proporcionar información semántica en relación a los elementos involucrados y su grado de implicación en los eventos seleccionados. La propuesta se presenta y evalúa utilizando el desafío VAST 2012 como caso de estudio, obteniéndose resultados altamente satisfactorios.Este trabajo ha sido parcialmente financiado por el MICINN a través del proyecto TEC2011-22579
Group-Wise Principal Component Analysis for Exploratory Intrusion Detection
Intrusion detection is a relevant layer of cybersecurity to prevent hacking and illegal activities
from happening on the assets of corporations. Anomaly-based Intrusion Detection Systems perform an
unsupervised analysis on data collected from the network and end systems, in order to identify singular
events. While this approach may produce many false alarms, it is also capable of identifying new (zeroday)
security threats. In this context, the use of multivariate approaches such as Principal Component
Analysis (PCA) provided promising results in the past. PCA can be used in exploratory mode or in learning
mode. Here, we propose an exploratory intrusion detection that replaces PCA with Group-wise PCA
(GPCA), a recently proposed data analysis technique with additional exploratory characteristics. A main
advantage of GPCA over PCA is that the former yields simple models, easy to understand by security
professionals not trained in multivariate tools. Besides, the workflow in the intrusion detection with GPCA
is more coherent with dominant strategies in intrusion detection. We illustrate the application of GPCA in
two case studies.This work was supported in part by the Spanish Government-MINECO (Ministerio de Economía y Competitividad), using the Fondo
Europeo de Desarrollo Regional (FEDER), under Projects TIN2014-60346-R and Project TIN2017-83494-R
Evaluation of Diagnosis Methods in PCA-based Multivariate Statistical Process Control
Multivariate Statistical Process Control (MSPC) based on Principal Component
Analysis (PCA) is a well-known methodology in chemometrics that is aimed at testing whether an industrial process is under Normal Operation Conditions (NOC).
As a part of the methodology, once an anomalous behaviour is detected, the root
causes need to be diagnosed to troubleshoot the problem and/or avoid it in the
future. While there have been a number of developments in diagnosis in the past
decades, no sound method for comparing existing approaches has been proposed.
In this paper, we propose such a procedure and use it to compare several diagnosis
methods using randomly simulated data and from realistic data sources. This is a
general comparative approach that takes into account factors that have not previously been considered in the literature. The results show that univariate diagnosis
is more reliable than its multivariate counterpart